ABSTRACT
Large 0--1 datasets arise in various applications, such as market basket analysis and information retrieval. We concentrate on the study of topic models, aiming at results which indicate why certain methods succeed or fail. We describe simple algorithms for finding topic models from 0--1 data. We give theoretical results showing that the algorithms can discover the epsilon-separable topic models of Papadimitriou et al. We present empirical results showing that the algorithms find natural topics in real-world data sets. We also briefly discuss the connections to matrix approaches, including nonnegative matrix factorization and independent component analysis.
- R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93, pages 207--216, 1993. Google ScholarDigital Library
- R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307--328. AAAI Press, 1996. Google ScholarDigital Library
- A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
- I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction. In KDD 2001, pages 37--46, San Fransisco, CA, Aug. 2001. Google ScholarDigital Library
- M. A. Carreira-Perpinan and S. Renals. Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Computation, 12:141--152, 2000. Google ScholarDigital Library
- P. Comon. Independent component analysis --- a new concept? Signal Processing, 36:287--314, 1994. Google ScholarDigital Library
- G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In Knowledge Discovery and Data Mining, pages 23--29, 1998.Google Scholar
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- S. Della Pietra, V. J. Della Pietra, and J. D. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380--393, 1997. Google ScholarDigital Library
- M. Gyllenberg, T. Koski, E. Reilink, and M. Verlaan. Non-uniqueness in probabilistic numerical identification of bacteria. Journal of Applied Probability, 31:542--548, 1994.Google ScholarCross Ref
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR '99, pages 50--57, Berkeley, CA, 1999. Google ScholarDigital Library
- A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001.Google Scholar
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, Oct. 1999.Google ScholarCross Ref
- C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In PODS '98, pages 159--168, June 1998. Google ScholarDigital Library
- D. Pavlov, H. Mannila, and P. Smyth. Probabilistic models for query approximation with large sparse binary datasets. In UAI-2000, 2000. Google ScholarDigital Library
- D. Pavlov and P. Smyth. Probabilistic query models for transaction data. In KDD 2001, 2001. Google ScholarDigital Library
- J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5):401--409, May 1969.Google ScholarDigital Library
Index Terms
- Topics in 0--1 data
Recommendations
Mining causal topics in text data: iterative topic modeling with time series feedback
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementMany applications require analyzing textual topics in conjunction with external time series variables such as stock prices. We develop a novel general text mining framework for discovering such causal topics from text. Our framework naturally combines ...
Detecting bursts in sentiment-aware topics from social media
Nowadays plenty of user-generated posts, e.g., sina weibos, are published on the social media. The posts contain the publics sentiments (i.e., positive or negative) towards various topics. Bursty sentiment-aware topics from these posts reveal sentiment-...
Sentiment analysis with global topics and local dependency
AAAI'10: Proceedings of the Twenty-Fourth AAAI Conference on Artificial IntelligenceWith the development of Web 2.0, sentiment analysis has now become a popular research problem to tackle. Recently, topic models have been introduced for the simultaneous analysis for topics and the sentiment in a document. These studies, which jointly ...
Comments